Skip to content

Conversation

@nabinchha
Copy link
Contributor

@nabinchha nabinchha commented Dec 2, 2025

Opening a draft PR for initial feedback. I currently have the modality extended to support embedding generation, but we can follow the same pattern to support image generation.

Major change is the need to break out InferenceParameters into generation type specific ones. Changes include renaming existing InferenceParameters -> CompletionInferenceParameters with backwards compatibility + deprecation warning.

I'm working on expanding to image generation in the same PR and will mark this as ready for review when that's done.

Here's an example of what the workflow looks like for embeddings

import json
import pandas as pd
from data_designer.essentials import (
    DataDesigner,
    DataDesignerConfigBuilder,
    EmbeddingColumnConfig,
    EmbeddingInferenceParameters,
    ExpressionColumnConfig,
    GenerationType,
    ModelConfig,
)

model_configs = [
    ModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        generation_type=GenerationType.EMBEDDING,
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    ),
    ModelConfig(
        alias="openai-embedder",
        model="text-embedding-3-small",
        provider="openai",
        inference_parameters=EmbeddingInferenceParameters(
            dimensions=768,
            encoding_format="float"
        )
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

with open("dummy_generated_data.json", "r") as f:
    full_generation_data = json.load(f)

config_builder.with_seed_dataset(
    dataset_reference=DataDesigner.make_seed_reference_from_dataframe(
        pd.DataFrame(full_generation_data),
        "tmp_dedup.json"
    ),
    sampling_strategy="ordered"
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="questions",
        expr="{% for pair in qa_generation.pairs %}{{ pair.question }}\n{% endfor %}"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_nvidia",
        model_alias="nvidia-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_openai",
        model_alias="openai-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

data_designer = DataDesigner()
result = data_designer.preview(config_builder)
result.display_sample_record()

if response.data and len(response.data) == len(input_texts):
return [data["embedding"] for data in response.data]
else:
raise ValueError(f"Expected {len(input_texts)} embeddings, but received {len(response.data)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be an issue if response.data = None?

Copy link
Contributor Author

@nabinchha nabinchha Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the documentation, upon calling .embedding(...) we're either going to get an EmbeddingResponse object or some exception will be raised. And EmbeddingResponse.data is list... the later check should kick in if that list is empty right?

input_chunks = [chunk.strip() for chunk in input_chunks if chunk.strip()]
embeddings = self.model.generate_text_embeddings(input_texts=input_chunks)
data[self.config.name] = {
"embeddings": embeddings,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these 3 fields are added as a JSON in a single column, correct? Considering that embeddings is list[list[float]] (?), is that an issue? Could it be the that JSON is added as a string and embeddings is encoded sub-optimally, truncated etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these 3 fields are added as a JSON in a single column, correct?

Yes, that's correct. I'll double check what happens when we serialize these as partial results and report back. I think they were serialized correctly without truncation when I ran some tests.

andreatgretel
andreatgretel previously approved these changes Dec 2, 2025
@eric-tramel
Copy link
Contributor

Specifying Model Type

In the current example, the nature of the model must be inferred by the provided inference parameter type. E.g. the only way to know that the following model is, in fact, an Embedding model is to see that the user provided EmbeddingInferenceParameters.

    ModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    )

Another natural thing to do would be to create inheritors of the ModelConfig so you can reference models directly by their config type for certain actions (like selecting which inference endpoints they use etc.):

    EmbeddingModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    )

This would also allow the flexibility later to be able to specify inference_parameters with a raw dict and still know which type under the hood to the cast them to for input verification.

Chunking As A Separate Action

Presently we have to specify some patterns for specifying chunking, but this seems a bit cumbersome to join together the chunking + embedding into a single step.

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_nvidia",
        model_alias="nvidia-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

Another option might be to create a separate ChunkedTextColumnConfig to specify the chunking operation itself (maybe there are other more complex flavors which require their own configuration kwargs). Then, one can chunk once, and perhaps apply multiple different embedders to the same "chunked text column". E.g.

config_builder.add_column(
    ChunkTextColumnConfig(
        name="questions_chunked",
        target_column="question",
        chunk_style="newlines"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_A",
        model_alias="model_A",
        target_column="questions_chunked",
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_B",
        model_alias="model_B",
        target_column="questions_chunked",
    )
)

@eric-tramel
Copy link
Contributor

Another advantage of the separate chunking pattern is that we can use it in other contexts, e.g. if we want to extract a random chunk from a document.

@nabinchha
Copy link
Contributor Author

@eric-tramel thanks for the feedback!

Specifying Model Type

IMO, ModelConfig is a simple enough data structure to contain all information regarding the model. Introducing additional layers to it, for example, EmbeddingModelConfig, ImageGenerationModelConfig seems a little complicated especially because despite that we still need classes to contain generation type specific inference parameters such as EmbeddingInferenceParameters, etc.

I initially added ModelConfig.generation_type to help distinguish between the types here, which we only need to perform health checks, but removed it because the type of inference parameters helped answer the same question.

Chunking As A Separate Action

This is a good call out... something I was playing around with. The main goal of this generator is to allow generation of multiple embeddings per cell. Allowing a way to split the content of the target column via regex pattern within embedding generation was something I thought was generic, but I totally see the simplicity of decoupling these two.

In the example you provided, after we use a different generator to do chunking, we still need a way to tell the embedding generator how to discover these different chunks.... unless we explicitly say that EmbeddingColumnConfig expects target columns to be a list of strings or a string. Is that what you had in mind?

@eric-tramel
Copy link
Contributor

IMO, ModelConfig is a simple enough data structure to contain all information regarding the model. Introducing additional layers to it, for example, EmbeddingModelConfig, ImageGenerationModelConfig seems a little complicated especially because despite that we still need classes to contain generation type specific inference parameters such as EmbeddingInferenceParameters, etc.

I initially added ModelConfig.generation_type to help distinguish between the types here, which we only need to perform health checks, but removed it because the type of inference parameters helped answer the same question.

Gotcha -- but then how does is this going to be handled at the CLI when inputting model config parameters? Will the user need to specify the kind of model at that point, too?

$ data-designer config models
...
What model type is this?
    -> llm
       vlm
       embedder

In the example you provided, after we use a different generator to do chunking, we still need a way to tell the embedding generator how to discover these different chunks.... unless we explicitly say that EmbeddingColumnConfig expects target columns to be a list of strings or a string. Is that what you had in mind?

Yep, in this case an EmbeddingColumn task could operate on either a string input or a list of strings (which can be represented by some internal data structure/type matching it). In the case of a list of strings, you get the embeddings of the list of strings, otherwise its the embedding of the single string.

It would be interesting in the future to consider the option of some kind of explode feature, like in pandas/polars. This would be entirely optional, but can allow a user to flatten the nesting of their dataset if that's not what they want. For instance, the below pattern would allow one to create a dataset of "all embedding chunks" from a set of source documents without needing to unpack or fiddle with nesting structures themselves.

## Start with N documents in the dataset, now chunk.
config_builder.add_column(
    ChunkTextColumnConfig(
        name="document_chunked",
        target_column="document",
        chunk_params={"style": "contiguous", "max_chars": 4096},
    )
)

"""
Now, assume M chunks / document. We now have N rows where each row contains
document_chunked_row_0 = [chunk_0_0, ..., chunk_0_M]
...
document_chunked_row_N = [chunk_N_0, ..., chunk_N_M]

Next, let's say we want to operate on each chunk independent for the rest of the workflow.
"""

config_builder.explode(name="document_chunked")

"""
Now, we have N*M rows in our dataset generated after document_chunked created.

document_chunked_row_0 = chunk_0_0
document_chunked_row_1 = chunk_0_1
...
document_chunked_row_M = chunk_0_M
...
document_chunked_row_NM = chunk_N_M

And perhaps we want to do our embedding now
"""

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_A",
        model_alias="model_A",
        target_column="questions_chunked",
    )
)

@nabinchha
Copy link
Contributor Author

nabinchha commented Dec 3, 2025

Will the user need to specify the kind of model at that point, too?

Right, something like that. I hadn't actually thought about the cli, so thanks for raising! It might make sense to hang generation_type off of ModelConfig introduced in this PR to make it more explicit:

class GenerationType(str, Enum):
    CHAT_COMPLETION = "chat-completion"
    EMBEDDING = "embedding"
    IMAGE_GENERATION = "image-generation"
    
class ModelConfig()
    alias: str
    model: str
    generation_type: Optional[GenerationType] = GenerationType.CHAT_COMPLETION
    inference_parameters: InferenceParametersT = Field(default_factory=CompletionInferenceParameters)
    provider: Optional[str] = None
    
    # Validate the type of inference parameters matches generation_type
 
 In the CLI, we'll just prompt the user to choose among the three and tailor inference parameter setup based on that choice. WDYT?

Updated in c6c29d4

@nabinchha
Copy link
Contributor Author

nabinchha commented Dec 3, 2025

Yep, in this case an EmbeddingColumn task could operate on either a string input or a list of strings (which can be represented by some internal data structure/type matching it). In the case of a list of strings, you get the embeddings of the list of strings, otherwise its the embedding of the single string.

It would be interesting in the future to consider the option of some kind of explode feature, like in pandas/polars. This would be entirely optional, but can allow a user to flatten the nesting of their dataset if that's not what they want. For instance, the below pattern would allow one to create a dataset of "all embedding chunks" from a set of source documents without needing to unpack or fiddle with nesting structures themselves.

The need to support generation of multiple embeddings per row in a single generator is exactly because within the NDDL workflow we don't yet have a way to explode and reduce rows. Until that becomes a reality, keeping the embedding generator simpler (to operate on strings or list of strings) without worrying about chunking should suffice and is generic enough. We can add chunking support in a different PR. Let me update this PR and incorporate your suggestions!

I removed the chunking param/logic in 06a724b. The embedding generator now expects column it targets to have a string or a stringified json list of strings.

@nabinchha
Copy link
Contributor Author

Closing this PR in favor of #106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants